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Abstract. The combinatorial basis of entropy, given by Boltzmann, can be written H = TV" 1 In W, 
where H is the dimensionless entropy, N is the number of entities and W is number of ways in 
which a given realization of a system can occur (its statistical weight). This can be broadened to 
give generalized combinatorial (or probabilistic) definitions of entropy and cross-entropy: H = 
K"(0(W) +C) and D — — K"(0(P) + C), where P is the probability of a given realization, is a 
convenient transformation function, K is a scaling parameter and C an arbitrary constant. If W or 
P satisfy the multinomial weight or distribution, then using </>(•) = ln(-) and K = N~ 1 , H and D 
asymptotically converge to the Shannon and Kullback-Leibler functions. In general, however, W or 
P need not be multinomial, nor may they approach an asymptotic limit. In such cases, the entropy or 
cross-entropy function can be defined so that its extremization ("MaxEnt" or "MinXEnt"), subject 
to the constraints, gives the "most probable" ("MaxProb") realization of the system. This gives a 
probabilistic basis for MaxEnt and MinXEnt, independent of any information-theoretic justification. 

This work examines the origins of the governing distribution P. These include: (a) frequentist- 
like models; (b) symmetry models; (c) prior MinXEnt models; (d) Kapur-Kesavan inverse models; 
and (e) game theoretic models. The combinatorial definition and MaxProb are consistent with 
these different approaches, and the notion of probabilistic inference, yet offer greater utility than 
traditional MaxEnt / MinXEnt based on the Shannon and Kullback-Leibler functions. 
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1. INTRODUCTION 

Fifty years ago, Jaynes [0Q gave the maximum entropy method (MaxEnt), based on the 
Shannon entropy [0: 

s 
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where pi is the (posterior) probability of occurrence of the zth distinguishable state 
within a system, from s such states. In the MaxEnt method, one maximizes the Shannon 
entropy of a system, subject to its constraints, to determine the "least informative" or 
"maximally noncommittal" probability distribution representing the system. From its 
inception, MaxEnt was advanced as a generic method of inference for the solution of 
indeterminate problems of all kinds, underpinned by information theory, not merely as 
an extension of mechanics [Q3[3l|4l[5l[6]|. MaxEnt was later extended into the maximum 
relative entropy, minimum divergence or minimum cross -entropy method (MinXEnt), 
involving extremization of the Kullback-Leibler measure [QUI: 



Dkl=TpM- (2) 

which allows for unequal prior probabilities qi. Since that time, MinXEnt and its sub- 
sidiary MaxEnt have been successfully applied to the analysis of a vast number of phe- 
nomena, throughout most fields of human study [e.g. [6l |9l [10l dU, and can rightly be 
regarded as one of the most important of all human discoveries. 

It must be emphasised, however, that the cross-entropy and entropy concepts which 
underpin MinXEnt and MaxEnt are themselves subject to many different philosophical 
interpretations. Dominant explanations include the axiomatic basis outlined by Shannon 
fl2|, and the information-theoretic ("bits" of information) and coding basis, recognized 
by Szilard ffT2l and Shannon [O [c.f. [131. These bases led Jaynes, in particular, to 
consider the Shannon and Kullback-Leibler functions to be the only logically consistent 
measures of uncertainty, and thus the only ones suitable for analysis. This view has 
been challenged by many researchers, on the grounds that the above two measures are 
too narrowly defined and/or inapplicable to many situations. For example, over the past 
85 years, many alternative entropy and divergence functions have been introduced [e.g. 

[HEllISEETlEESIISIiainil^ in most cases > tnese are 

incompatible with the Shannon and Kullback-Leibler functions, but have proved useful 
for the analysis of specific classes of systems. Can such measures be explained by some 
broader philosophical framework? How should we choose the "correct" cross-entropy or 
entropy function for a given problem? The fact that such questions remain unanswered 
indicates the need for a unifying philosophical framework, which encompasses (and 
explains) such alternative entropy measures and their connections to information theory. 

This study examines one such framework: the combinatorial (or probabilistic) basis 
of entropy, first given 130 years ago by Boltzmann OT1 and subsequently promoted by 
Planck lT3~2ll . This involves the maximization of a governing probability distribution P or 
weight W of a system; this can be viewed as a generalized principle of probabilistic in- 
ference, aptly described by Vincze and Grendar & Grendar as the maximum probability 
("MaxProb") principle Il33ll34l . It also leads to generalized definitions of cross-entropy 
and entropy, based purely on probability theory 051 . In this study, specific attention is 
paid to the origins of the governing distribution P, including (a) frequentist-like models 
(e.g. ball-in-box or urn models); (b) symmetry models; (c) prior MinXEnt models; (d) 
Kapur-Kesavan inverse models; and (e) game theoretic models. It is shown that the com- 
binatorial basis is consistent with these different approaches, but is more soundly based 
and offers greater utility than traditional MaxEnt / MinXEnt based on the Shannon and 
Kullback-Leibler functions. 



2. THE COMBINATORIAL BASIS 

Owing to a tremendous confusion in terminology - especially amongst physicists - it is 
first necessary to rigorously define several important terms [c.f. |35]|. An entity is here 
taken to be a discrete particle, object or agent within a system, which acts separately but 
not necessarily independently of the other entities present. A system is a collection of 
entities with a defined boundary, subject to various constraints, which may or may not 
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FIGURE 1. Definition of terms used in the combinatorial basis of entropy and cross-entropy. 



be open to the exchange of specified entities or substances with an external environment. 
The entity therefore constitutes the unit of analysis of a system. 

Now consider a simple "ball-in-box" model of a system, shown in Figure [j] in which 
N distinguishable entities (balls) are allocated to s distinguishable non-degenerate states 
(boxes). As shown: 

• A state refers to each different category or element of system (e.g. energy levels, sides 
of a die or alphabetic symbols). The states are therefore properties of, or associated 
with, each individual entity in the system. 

• A configuration is a distinguishable permutation or pattern of entities amongst the 
states of a system (a complexion, microstate or sequence). A configuration is therefore 
a property of the system as a whole. 

• A realization is each aggregated arrangement of entities amongst the states of a system, 
as specified by some rule, for example by the number of entities in each state (a macro- 
state, outcome or type). In general, a realization will constitute a set of configurations, 
since several configurations could give the same realization (see Figure [T]). 

There is such confusion in and sloppy usage of the terms state, microstate and macro- 
state - severely impairing understanding - that the last two terms should be avoided. In 
the following, the states are indexed i = l,...,s (which may be multivariate); n, denotes 
the number of entities in the zth state; q\ and p\ = tiijN respectively denote the prior 
and posterior probabilities of a entity being in the zth state; and each realizatiorQis de- 
noted {rii}. Notwithstanding other philosophical differences with Jaynes, the "subjective 
Bayesian" definition of probabilities, as assignments based on what we know, is adopted 
here GUI. 

For the analysis of probabilistic systems, it is possible to delineate a principle which 
stands out from all others: the maximum probability ("MaxProb") principle OTl l32l l33l 
[34l[35l. This can be stated as: 

"A system can be represented by its realization of highest probability." 



A realization can only be denoted {p{\ in the asymptotic limits N — > °° and — > °°,Vz", since {/?,} 
discards information about the value of N. 



This seemingly trivial statement provides a powerful principle for probabilistic infer- 
ence, which is independent of any information-theoretic considerations. This is critical, 
since in any contradiction between information theory and probability theory - for ex- 
ample, between the distributions inferred by each approach - probability theory must 
triumph. Like MinXEnt or MaxEnt based on the Kullback-Leibler or Shannon mea- 
sures, MaxProb is a method of inference (inductive reasoning), which does not give 
certainty in its predictions. Unlike them, however, MaxProb is founded solely on prob- 
ability theory. Indeed, MaxProb does not depend upon any asymptotic limits (a feature 
of the "frequentist" definition of probability, in which probabilities must correspond to 
measurable frequencies JTJ [6)); it can therefore be applied to systems containing finite 
numbers of entities Il29ll30l . 

Allied to MaxProb is a generalized form of the second law of thermodynamics: 

"A system will tend towards its most probable realization." 

This provides a purely probabilistic rationale for use of the MaxProb principle, indepen- 
dent of thermodynamics. In effect, if we adopt MaxProb as a principle for probabilistic 
inference, the above statement is its corresponding ergodic principle, which (on average) 
explains its success. Of course - as expressed by Jaynes [1] - the concept of ergodicity 
is not needed for the purpose of inference, since in the absence of other information, we 
are fully justified in conducting inference without it. 

The MaxProb principle also leads to the combinatorial definition of entropy, first 
given by Boltzmann OH and Planck [32J. This can be written as: 

// = ^ 1 lnW, (3) 

where W is the number of ways in which a given realization can occur, referred to as its 
statistical weight. Maximization of the entropy H of a system, subject to its constraints, 
therefore selects the realization of highest weight W (the logarithmic function being 
a monotonic transformation, which does not alter the position of the extremum). Eq. 
([3]) can be extended to give generalized combinatorial (or probabilistic) definitions of 
cross-entropy and entropy Il35ll : 

D = -kt(0(P) + C), # = k(0(W)+C), (4) 

where P = P({rii}\{qi},N,s,I) is the probability of a given realization, subject to the 
prior probabilities {<?/}, number of entities N, number of states s and background infor- 
mation /; <j) is a convenient monotonic transformation function; K is a scaling parameter; 
and C is an arbitrary constant. This perspective is summarised in Figure[2j If P or W sat- 
isfy the multinomial distribution or weight: 

F = N\f\^, W = N\f\± (5) 

i=\ n i- i=\ n i- 

then by taking 0(-) = ln(-), K = N~ l and the asymptotic limits N — > °° and nj — > oo 5 Vi 
(the "Stirling approximation"), D and H converge respectively to the Kullback-Leibler 
and Shannon functions ([T])-([2]) ll33l l34l . This provides a (well-known) justification for 
these functions, and their corresponding MinXEnt and MaxEnt principles, as a special 
case, independently of the arguments used in information theory. 
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FIGURE 2. Schematic flowchart of the combinatorial basis of entropy and cross-entropy. 



In general, however, P or W need not be multinomial, nor may they approach an 
asymptotic limiting form. In such cases, extremization of the cross-entropy or entropy 
defined by @, subject to the constraints, gives the most probable (MaxProb) real- 
ization of the system (in the non-asymptotic case, due to the effect of quantization, 
extremization gives an "attractor" distribution which lies close to but not necessarily 
equal to the MaxProb realization (351). In consequence, the combinatorial definitions 
(|4]) remain consistent with the rules of probability theory, whilst inference using the 
Kullback-Leibler or Shannon measures may lead to inconsistencies. The combinatorial 
(or probabilistic) definitions are therefore more broadly applicable than those derived 
from information- theoretic considerations. 

In the foregoing discussion, the astute reader will notice that there may be many 
different ways to classify the entities and states of a system, and hence to identify its 
configurations; and many different ways to group the configurations into realizations. We 
are therefore led to the "subjective" (or "observer-dependent") view of the entropy and 
cross-entropy concepts, a sentiment vocally defended by Jaynes 0]|. This was succinctly 
expressed by Tseng and Caticha [36J: 

"Entropy is not a property of a system ... [it] is a property of our description of a 
system." 

The fact that the thermodynamic entropy S is always defined in the same manner, 
allowing thermodynamicists to make consistent calculations, should not fool the reader 
into believing that the entropy concept is "objective'^} 



It is important that the symbol S be devoted exclusively to the thermodynamic entropy, since it is a 
special case of - but is distinct from - the dimensionless Shannon entropy ([TJ. 



3. ORIGINS OF THE GOVERNING DISTRIBUTION 



We now consider the origins of the governing distribution P or weight W used in the 
combinatorial formulation. The problem of justifying a cross-entropy or entropy func- 
tion is now replaced by a deeper problem, of how to justify its governing distribution. 
The ubiquity of the Kullback-Leibler and Shannon measures, in many circumstances, 
therefore leads to the question: why the multinomial distribution? This question, and the 
choice of P, is examined from five different perspectives. 

(a) Frequentist-like Models 

In this approach, one simply asserts a governing distribution P or weight W as a 
probabilistic model of the system under consideration. One may have strong grounds, 
based on prior knowledge of a problem, for such an assertion; in any case, we should 
have no "fear of failure" of this method (in Jaynes' words), since if the model gives 
unsuccessful predictions, we have learnt that it is incorrect. Many such models are 
available from classical probability theory, for example the ball-in-box models of the 
type represented in Figure [T] Since these arose from frequentist studies, they can be 
termed "frequentist-like" models, although used here for the purpose of inference. 

In the case discussed previously, in which distinguishable balls are allocated to dis- 
tinguishable boxes in accordance with a set of constant prior probabilities (see Figure 
[j}, one obtains the multinomial distribution ([5]), and hence the Kullback-Leibler cross- 
entropy and Shannon entropy functions in the Stirling limits. However, different assump- 
tions lead to different model distributions. If the asymptotic limits are not applied, then 
from ([5]), one obtains a non- asymptotic cross-entropy function |c.f.|29l: 

s s 

-D\ L = AT 1 InP = AT 1 {hi AH + £ /z f ln# - £ Inn;! } 

(6) 

= £ {PiN' 1 In AH + pihiqt - AT 1 ln[(^A0 !] } 

This is applicable to systems with finite (small) N. Minimisation of D X KL , subject to the 
usual constraints Ef=i n i = N and E'/=i n '/"' = N(f r ), for r = l,...,R, where f r i is the 
rth function of each state i and (f r ) is its mathematical expectation, gives the "most 
probable" distribution [c.f . |29l : 



pf = N^ 



Xj/- 1 (N-HnNl+lnqt - Aq - £ Kfri) ~ 1 



r=\ 



(7) 



where \jf~ (•) is the inverse digamma function. Eq. (|7j can be viewed as the "attractor" 
for systems with finite N, which differs from the attractor given by traditional MinXEnt. 

If the states are considered to contain g, distinguishable, degenerate sub- states within 
each distinguishable state i, then three cases have been examined historically: (i) dis- 
tinguishable entities; (ii) indistinguishable entities; and (hi) indistinguishable entities, 
with a maximum of one entity in each state. The resulting distributions were given by 
Brillouin 0711381 as, respectively: 



{G + N- 1)! n ( -!(g;-l)! 
G! 

where G = L£ = igi is the total degeneracy. The truncated weights and entropy functions 
corresponding to these distributions, referred to respectively as the Maxwell-Boltzmann, 
Bose-Einstein and Fermi-Dirac distributions respectively fe. g 1T61 ITTl IT8l |T9l 1201 [37l 1381 
[39l [40l STJ , played an important role in the development of quantum theory. In the non- 
asymptotic case, the resulting entropy functions appear to have profound information- 
theoretic consequences [|29ll30l . 

Recently, a quite different ball-in-box model was considered, in which distinguish- 
able entities are allocated to indistinguishable, equally degenerate states. The statistical 
weight of each realization {rii} can be expressed as [l42l : 



Nl X min ^' ni) rn, 



where there are k non-empty states amongst the s states; g is the degeneracy of each 
state; { y } is a Stirling number of the second kind; and rj is the number of occurrences 
of integer j in the set {«;}. The combinatorial entropy corresponding to ( fTT| ), H D .ji g \ = 

N~ l lnW D .j(gy does not appear to have a straightforward asymptotic form, except in the 
non-degenerate case g = 1 with k = s, when it reduces to the Shannon entropy. 

Closely related to but distinct from ball-in-box models are urn models, in which a 
container (urn) is set up with a total of M balls, made up of m,- balls of each color i. 
Balls are then drawn from the urn in accordance with some sampling scheme, recorded 
and returned to the urn (or the urn modified in some way), and the sampling repeated 
[c.f. 03l 031- The asymptotic limits of an infinitely large urn (M — > °° and m, — ► °°, Vi), 
and an infinitely large (smaller) sample (N — > °° and — > °°,Vi), are usually applied. 
Although quite different to the ball-in-box model of Figure [TJ an urn model with simple 
replacement also yields the multinomial distribution Il43l l44ll. Urn models involving 
the drawing of balls without replacement, or double replacement, lead respectively to 
the Fermi-Dirac and Bose-Einstein distributions [|43l . Urn models also readily permit 
the construction of systems in which the prior probabilities are not independently and 
identical distributed (non-iid sampling): e.g. the Polya distribution, in which after every 
draw, the ball is returned, and c balls of the same color are also added [|45ll46ll47ll48ll : 

AH " m i (m i + c)...{m i + (n i -l)c) 
f ln . l l\M(M + c)...(M+(N-l)c)> {U) 



Substituting the initial prior probabilities = m ( -/M and parameter /3 =N/M, this gives 
analytic cross-entropy measures in the non-asymptotic and asymptotic cases [|48l . The 
resulting "most probable" distribution is intermediate between the Bose-Einstein and 
Fermi-Dirac distributions, with physical applications. 

(b) Symmetry-Based Arguments 

One may also choose a governing distribution on the basis of symmetry arguments 
(related to the "principle of insufficient reason"). For a system made up of tosses of 
a coin, it is rational to consider the sampling to follow the binomial distribution, with 
equal prior probabilities of \ for each face, due to the symmetry of the states (there being 
no information to suggest that one state should be preferred). Alternatively, as suggested 
by David Blower at MaxEnt07, one can obtain a binomial distribution by the symmetry 
of all possible models in the model space (assigning a uniform prior to the models, 
over the entire spectrum from an all-head to an all-tail model, there being no reason 
to prefer any model). Applied to systems with more than two states, either argument 
leads to the multinomial distribution. In this respect, the multinomial distribution plays 
a role somewhat analogous to a central limit theorem (a "central model theorem"), a 
point which deserves greater mathematical attention; this may be the reason for the 
ubiquity of the Kullback-Leibler and Shannon measures. Without symmetry, however, 
the argument breaks down, and one must adopt some other method to identify the 
governing distribution. 

(c) Prior MinXEnt Models 

A third origin of the governing distribution P is as a result of the application of 
MinXEnt at a higher level, for example to the set of systems within which the actual 
system resides. For example, the multinomial distribution can be obtained by MinXEnt 
based on the Kullback-Leibler cross-entropy, subject to a multinomial prior and mean 
constraints on each variate [fTTTl . This might then be imposed as a lower-level governing 
distribution. One can in fact envisage a hierarchy of governing and "most probable" 
distributions, at different levels of description. In a complex system, in which there is 
bidirectional feedback, the result will be a mosaic of interconnected probabilistic models 
(with thanks to the discussion by Tony Bell at MaxEnt07). 

(d) Kapur-Kesavan Inverse Models 

The governing distribution P can also be obtained by extension of the arguments of 
Kapur and Kesavan [fTTTl . in which one works backwards from an observed probability 
distribution {p*}, prior distribution {g,} (if available) and any constraints, to derive 
the measure of cross-entropy or entropy applicable to a system. By unravelling of 
the asymptotic limits, this could (at least in principle) be extended to determine the 
governing distribution of the system. This avenue of research has not been examined in 
detail, and deserves greater attention. 

(e) Game Theoretic Models 

The final method considered here is to derive the governing distribution of a system by 
analysis of a code-length game between the system ("Nature") and the observer [|49l[50l . 
For a multivariate system of iid random variables, which take discrete values, this yields 
the multinomial distribution at game-theoretic equilibrium ||49Tl . As in case (c), this could 
then be imposed as the governing distribution at a lower level of description. 



4. CONCLUSIONS 



This study examines the MaxProb principle, in which a system is represented by its 
distribution of highest probability. This can be interpreted as a generalized method of 
probabilistic inference, which does not provide certainty in its predictions, yet is always 
consistent with the rules of probability theory. In contrast, inference using the Kullback- 
Leibler cross-entropy or Shannon entropy functions, in cases in which the governing 
distribution is not multinomial and/or does not satisfy the asymptotic limits, can lead 
to inconsistencies. The MaxProb principle also gives rise to generalized combinatorial 
definitions of cross-entropy and entropy, an extension of the idea given by Boltzmann 
130 years ago. The cross-entropy or entropy can therefore be defined so that its extrem- 
ization, subject to the constraints, gives the "most probable" ("MaxProb") realization of 
the system. This provides a purely probabilistic basis for MaxEnt and MinXEnt, which 
is independent of any information-theoretic justification. 

This work examines the origins of the governing distribution P, including by (a) 
frequentist-like models (e.g. ball-in-box or urn models); (b) symmetry models; (c) prior 
MinXEnt models; (d) Kapur-Kesavan inverse models; and (e) game theoretic models. 
It is shown that the combinatorial definition and MaxProb are consistent with these 
different approaches, and the "subjective Bayesian" definition of probability, yet is more 
broadly based and offers greater utility than traditional MaxEnt / MinXEnt based on the 
Shannon and Kullback-Leibler functions. 
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